home *** CD-ROM | disk | FTP | other *** search
-
- This is a collection of fast chunky-to-planar routines implemented into blitz
- basic for use in any software including commercial and shareware.
-
- There are five standard c2p's which have two versions, one to do a normal c2p
- operation and the other to do c2p as well as a clearscreen at the same time
- (25-30% faster than seperate clearscreen). There is a sixth c2p that is of a
- different design and has special requirements.
-
- c2p030only : Only for use on 68030 cpu's. For 68030 users, this c2p
- will perform better than all the others.
-
- c2p030onlyCLS: As above, except that it also clears (to a given longword)
- the chunky buffer that it has just read data from.
-
- c2p040only : Only for use on 68040 cpu's, but performs very well on
- anything higher. For 68040 users, this c2p will perform
- better than all the others.
-
- c2p040onlyCLS: As above, except that it also clears (to a given longword)
- the chunky buffer that it has just read data from.
-
- c2p060only : Only for use on 68060 cpu's. It is not, however, the fastest
- and you will find that c2p040only and c2pCACHE are faster.
- Probably does not perform very well on anything lower than
- an 060.
-
- c2p060onlyCLS: As above, except that it also clears (to a given longword)
- the chunky buffer that it has just read data from.
-
- c2pGeneric : A generic c2p for use on all cpu's if it is not possible
- to isolate the cpu model or to use seperate c2p's for
- different cpu's. Performs well on 030 but is somewhat slower
- than dedicated routines on higher processors. Mainly to provide
- support for 030, as higher processors will be crippled.
-
- c2pGenericCLS: As above, except that it also clears (to a given longword)
- the chunky buffer that it has just read data from.
-
- c2p040plus : A kind of generic routine for 040 or higher. Performs generally
- quite well on all 040 and 060 cpu's but is not as fast as
- dedicated c2p's. Not suitable for 030's.
-
- c2p040plusCLS: As above, except that is also clears (to a given longword)
- the chunky buffer that it has just read data from.
-
- c2pCACHE : Designed for use on anything from a 68040 upwards. Second
- fastest on 040/25 but joint fastest on 060/50. Perhaps more
- geared towards 68060 than anything lower. This c2p is less
- flexible and requires special treatment.
-
- c2pCACHECLS : This does not exist as it is not possible to meddle with the
- way that the routine specially handles datacaches, which would
- result if there were any additional writing to memory such
- as when a clearscreen is performed.
-
- Some general performance times for the routines are as follows. These times
- are inclusive of having a screen open and being displayed, of the specified
- dimensions (* indicates best suitability for the given c2p):
-
- c2p030only : On 68030/50Mhz PAL
- 68030 only * 320x200 @40.4fps
- * 320x256 @30.7fps
-
- On 68040/25Mhz DoublePAL
- 320x200 @42fps
- 320x256 @31fps
-
- On 68040/25Mhz PAL
- 320x200 @44fps
- 320x256 @36.5fps
-
- c2p030onlyCLS: On 68040/25Mhz PAL
- 68030 only 320x200 @38.5fps
- 320x256 @29.7fps
-
-
- c2p040only : On 68030/50Mhz PAL
- 68040 320x200 @28.2fps
- to 320x256 @21.6fps
- 68060
-
- On 68040/25Mhz DoublePAL
- * 320x200 @49.6fps
- * 320x256 @36.2fps
-
- On 68040/25Mhz PAL
- * 320x200 @55.3fps
- * 320x256 @42.5fps
-
- On 68060/50Mhz PAL
- * 320x200 @66.1fps
- * 320x256 @50fps
-
- c2p040onlyCLS: On 68040/25Mhz PAL
- 68040 * 320x200 @49fps (seperate clearscreen ran about 45-46fps)
- to * 320x256 @37.1fps
- 68060
-
- c2p060only : On 68030/50Mhz PAL
- 68060 only 320x200 @27.9fps
- 320x256 @21.5fps
-
- On 68040/25Mhz DoublePAL
- 320x200 @46.0fps
- 320x256 @34.2fps
-
- On 68050/25Mhz PAL
- 320x200 @48.2fps
- 320x256 @37.4fps
-
- On 68060/50Mhz PAL
- * 320x200 @66fps
- * 320x256 @50fps
-
- c2p060onlyCLS: On 68040/25Mhz PAL
- 68060 only 320x200 44.5fps
- 320x256 33.8fps
-
-
- c2pGeneric : On 68030/50Mhz PAL
- all, but * 320x200 @40.1fps
- mainly 68030 * 320x256 @30.7fps
-
- On 68040/25Mhz DoublePAL
- 320x200 @42fps
- 320x256 @31fps
-
- On 68040/25Mhz PAL
- 320x200 @44fps
- 320x256 @34fps
-
- c2pGenericCLS: On 68040/25Mhz PAL
- all, but 320x200 @38.8fps
- mainly 68030 320x256 @29.7fps
-
-
- c2p040plus : On 68030/50Mhz PAL
- 68040 320x200 @24.3fps
- to 320x256 @18.5fps
- 68060
- On 68040/25Mhz DoublePAL
- * 320x200 @46fps
- * 320x256 @34.2fps
-
- On 68040/25Mhz PAL
- * 320x200 @49.2fps
- * 320x256 @37.9fps
-
- On 68060/50Mhz PAL
- * 320x200 @66fps
- * 320x256 @50fps
-
- c2p040plusCLS: On 68040/25Mhz PAL
- 68040 * 320x200 @45.6fps
- to * 320x256 @35fps
- 68060
-
-
- c2pCACHE : On 68030/50Mhz PAL
- 68040 320x200 @23.5fps
- to 320x256 @18.0fps
- 68060
-
- On 68040/25Mhz DoublePAL
- * 320x200 @47.1fps
- * 320x256 @35.3fps
-
- On 68040/25Mhz PAL
- * 320x200 @50fps
- * 320x256 @38.3fps
-
- On 68060/50Mhz PAL
- * 320x200 @66.1fps
- * 320x256 @49.6fps
-
-
- For 68030 owners, do not use c2p040plus, c2p040only, c2p060only, or c2pCACHE.
- These will give very bad performance on that cpu.
-
- All of the routines except for c2pCACHE allow you to specify the size of the
- chunky-to-planar operation by way of a c2pRoutineInit{} statement, where
- `Routine' is the name of the routine (e.g. c2p040onlyInit{}). If you alter the
- size of the c2p operation you should generally also alter the size of your planar
- destination bitmap to be equal.
-
- It is, however, possible to have a taller planar bitmap than the height of the
- chunky-to-planar operation. #c2pBPLSIZE has to be altered to reflect this. The
- planar height must always be equal to or greater than the chunky height.
-
- Each c2p routine has two inputs. The first parameter is the address of the chunky
- buffer and the second parameter is the address of the planar buffer. Planar
- memory must be contiguous so I suggest initialising a bank or reserving some
- memory, and then using CludgeBitmap. The inputs to the init statements are the
- width and height of the chunky buffer, hense the size of the c2p operation.
- The init routine only needs to be called once in a program for any number of c2p
- calls.
-
- c2pCACHE is different in that you must specify operation size in constants which
- cannot easily be altered during the running of the program, so you are restricted
- to one size of operation per program run.
-
- All you have to do to setup a c2p operation is something along these lines (for
- example):
-
- InitBank 2,320*256,$10000 ; Fastram chunky buffer
- InitBank 0,320*256,$10002 ; Chipram planar buffer
- CludgeBitmap 0,320,256,8,Bank(0)
- c2pGenericInit{320,256}
- c2pGeneric{Bank(2),Bank(0)}
-
- Of course, replace the c2pGeneric statements with the ones for the relevent c2p
- that you are using.
-
- The only exception to this is c2pCACHE. This requires that you cludge bitmaps to
- 8 bytes past the start of the planar buffer, and that you tell the c2pCACHE
- routine to output to an address 4 bytes past the start of the planar buffer. So
- you have to allow for this by reserving a little extra memory. Like
- this:
-
- InitBank 2,320*256,$10000 ; Fastram chunky buffer
- InitBank 0,(320*256)+8,$10002 ; Chipram planar buffer
- CludgeBitmap 0,320,256,8,Bank(0)+8
- c2pCACHE{Bank(2),Bank(0)+4}
-
- As well as c2pCACHE having to be set up with constants, there is also no
- clearscreen version because it is not possible to implement it due to the nature
- of the way the c2p works.
-
- Generally you should ensure that the base address of a planar bitmap's bitplane
- data is aligned to the nearest 64 pixels. Reserving some memory with AllocMem or
- InitBank usually seems to do this very reliably. c2pCACHE requires that you
- create bitmaps at 8 bytes past the start of the data, and that you begin the c2p
- operation at 4 bytes past the start. This is to ensure that the data being
- displayed is 64-bit aligned otherwise you would get a lower datafetch.
-
- In amigamode, if the first longword of data that is being displayed is from a
- 64-bit aligned address, the o/s will use 64-bit datafetch which means faster
- chunky-to-planar conversion. If you begin to scroll the display with hardware
- scrolling and you go beyond 32 pixels, the first longword being displayed will no
- longer be 64-bit aligned, and so the o/s will automatically switch to fetchmode 1
- or 2 (32-bit datafetch), which will slow down the c2p. More horrifically, the o/s
- will not use normally use fetchmode 0 but YOU should make sure that if you set
- the datafetch you do NOT use datafetch 0 because that will at least double the
- time it takes to do the c2p operation, and that is bad news.
-
- To do scrolling with chunky screens it is not normally the best idea to use
- hardware scrolling. The c2p's do not have a line modulo so you would have to make
- your planar bitmap 64 pixels wider which means a further 64x200 or 64x256 area to
- be converted. This is also a waste because one longword in chunky is only 4
- pixels, so a harware scroll of 0..3 is normally all that is requires. So the
- remaining 60 pixels are a total waste. As such, I recommend using software
- scrolling and generally speaking, if you have enough power to use
- chunky-to-planar well then you should also be thinking of refreshing the whole
- screen every frame rather than any of the traditional scroll methods. Taking a
- leap to using chunky is also to take a leap towards other factors which come as
- part of the package. Screens are normally fully refreshed each frame, scrolling
- is done in software, blits are done with the cpu and generally there is cpu
- horsepower to back this all up. 030/50's are generally going to be a little
- limited in what can be achieved with a decent screensize. I suggest 040/25 is the
- entry-level for chunky-to-planar equipped software, unless you have direct output
- to a graphics card which does not therefore require any data-conversion.
-
- For a purely generic setup, use the c2pGeneric routine. It will, however, be
- quite crippling to 040 or 060 processors but will better support the low end. To
- take things one stage further, use also c2p040plus which is a generic routine for
- 040-060 cpu's. To take it to the next level you should be looking to have a
- specific routine for each cpu. For 030's use c2p030only and for 040's use
- c2p040only. It seems that c2p040only is actually faster than c2p060only when
- running on a 68060/50, but there is hardly anything in it so take your pick.
- c2pCACHE is another replacement possibility for c2p040plus and is faster but less
- flexible. Certainly you don't need to support ALL of the c2p's in your software
- as there is quite a lot of overlap and it may come down to personal taste.
-
- Personally I would use c2p030only for anything below 040, and c2p040only for
- anything from 040upwards. If I had to choose ONE generic c2p I would go for
- c2p040plus as it performs slighter better than c2pCACHE when used on 030's,
- although either of them on 030 are pretty poor, so I would generally target the
- software at 040 upwards.
-
- Generally speaking, the clearscreen routines save you between 3 and 5 frames per
- second compared with having to do a seperate clearscreen routine. Time is mainly
- gained by minor pipelining and the fact that the c2p routine is already handling
- and setting up the loop. All that has been done to facilitate the clearscreen is
- that (in most cases) a7 is loaded with #clearscreento, which is a longword, and
- then move.l (a0)+,Dn is converted to move.l (a0),Dn : move.l a7,(a0)+ ; or if it
- is the 030only or generic routine, then it has been converted to move.l (a0),Dn :
- move.l #clearscreento,(a0)+, because those routines do not have a7 spare.
-
- It is possible to do a screencopy at the same time as the c2p, but this is not
- feasible on c2p030only or c2pGeneric as there needs to be a spare register.
- Therefore, a screencopy in place of the clearscreen (which will also do the same
- thing as a clearscreen, effectively) is only viable on 040 upwards, and judging
- by the time it takes it may be better suited to 060 only. It is therefore
- suggested that it might be faster to do a seperate screencopy which is perhaps
- hardcoded and may use move16, which may equal or surpass the time that might be
- saved by doing the screencopy at the same time as the c2p. I HAVE done a
- screencopy test using c2p040only, in which move.l (a0)+,Dn has been converted to
- move.l (a0),Dn : move.l (a7)+,(a0)+ ; and it seems to perform @44.3fps for
- 320x200, or 34.1fps for 320x256 (040/25 results). This is an extra 3 frames per
- second on top of the c2pCLS time, or about 9-10fps for the copy compared with a
- c2p that does not do anything additional (c2p040only). If you can do a screencopy
- seperately, perhaps using movem or move16, faster than this on 040/25, then I
- suggest you do that rather than modify the c2p. Judging by the time it takes and
- the number of chunky blits it gobbles up I would suggest that fullscreen copy is
- not very viable on anything lower than 040 and is probably questionable on
- 040/25.
-
- If you have a horizontal strip at the top or bottom or even middle of your
- display that does not need to be clearscreened and yet is updated a lot, use a
- clearscreening c2p for the main game area and a non-clearscreening one for the
- panel area.
-
- When it comes to chunky blitting you need to take into account the processor you
- are working on. If you have anything from 68040 upwards it is faster to have mask
- data (same size as the graphic) and to write longwords to non-aligned addresses,
- than to try generating the mask on-the-fly. The code: move.l (a2)+,d0 : move.l
- d0,(a1) : move.l (a0)+,d1 : move.l d1,(a1)+ ; will do one longword of masked blit
- to anywhere on the screen, about 2-3 times faster than if you try to generate the
- mask from the source data. Also, writing to byte addresses is probably not
- supported on 68000, I'm not sure about 68020. But if you do it with a copyback
- cache it is very quick, so that masked blits are only about 30% slower than
- unmasked ones. If you are not going to write to non-aligned addresses you have to
- do shifting or rotating in the cpu, which if using mask data means the mask as
- well. This takes further time. But these memory-intensive methods may not prove
- to be quite so efficient on 030's as they do not have a copyback cache.
-
- I did not write any of the c2p's myself, only the minor modifications and the
- example program. You are free to use them all in any of your productions,
- freeware, shareware, and even commercialware. I hope you are thankful to those
- talented few that have mastered their craft in making these c2p's and for
- releasing them for public use.
-
- Please find also enclosed in this archive a demonstration program. There is an
- 040 version for 040-to-060, and an 030 version. This program will use a
- clearscreening c2p and will bounce a number of chunky cpu-blitted objects around
- the screen. The blit routines do not do any clipping and the loop for movement
- and rendering of the objects is currently hardcoded into a single statement. This
- is quite a bit faster than calling a statement for every object.
-
- There are some constants which you can alter. The demonstration program has the
- facility to have a planar bitmap height larger than the chunky bitmap height. You
- must not allow it to be smaller, however! #planarheight should be >= #c2pBPLY.
- If you use: #c2pBPLY=200 : #planarheight=256 ; the routine will do a c2p
- operation on the first 200 lines and leave the bottom 56 lines as they are. This
- shows how you can use the verticle modulo should you need to.
-
- There are other constants to alter. #iterations is how many loops will be done
- before the program exits. #objcount is the number of cpu-blit objects that will
- be moved and drawn. Refer to the example 040/25 results for guidelines as to what
- to set this at. #objwidth and #objheight are the size of the objects. They must
- not be larger than the size of the chunky buffer and preferably should be at
- least about 16 pixels smaller in both dimensions for the movement routine to work
- properly. #objwidth must be a multiple of 4 and must not be smaller than 4.
- #objheight does not need to be a multiple of anything but must not be smaller
- than 1. The routine currently has constants which will render 85 32x32 256-colour
- masked objects with a screen size of 320x240. You should use PAL as preferable to
- DoublePAL if you want a higher framerate. 320x200 will yield even higher results.
- If you alter the chunky height don't forget about the planarheight.
-
- #objmasking should be set to either 0 or nonzero, which means you can use
- anything other than 0 (1, -1, 20, -50, etc). If zero, there will be no masking
- performed and you will attain higher output, but all objects will be solid. If
- objmasking is nonzero, there will be masking and any zero pixels will be
- transparent. The routine will default to using masking. The mask routine
- uses a prerendered mask image, similar to planar masks, except that it is a byte
- for every pixel. This is unavoidable if there is to be such speedy processing. It
- is perfectly feasible to generate the mask different to the graphic data so that
- any number of colours will be transparent. Don't forget that if you specify an
- area as solid when it is blank in the graphic, the blank pixel will be drawn.
-
- There is very little difference between the masked and unmasked routines. You
- will notice that the masked routine could be simplified as and.l (a2)+,(a1) :
- or.l (a0)+,(a1)+ ; but this is illegal in 68000 so I have had to expand it a
- little. Currently both routines will allow total flexibility in terms of width
- and height (width to nearest 4 pixels), and so use an x loop nested inside a y
- loop. If you expand the x loop for a hardcoded version you will get more output,
- and similarly with the y loop, although hardcoded large objects take up too much
- space to work in cache. Whereas it is possible to do 900 8x8 masked objects on
- 040/25, it is possible to do 1100 if the routine is hardcoded for 8x8 with both
- loops fully unfolded (ie no loops). The larger the objects you use the less
- intermediary time is used in setting them up. Lots of small objects take a lot of
- processing of the movement table. Typically, there is time to do about two and a
- half 320x200 screenfulls of blitting in the time left after the clearscreening
- c2p, on 040/25. The objects that the demoroutine uses are generated when you run
- the program so they are only basically random pixels and the palettes are fairly
- random too.
-
- I have also added a second demo program, which is the same as the first except
- that it is dynamic in the number of objects that it displays. You set it a target
- frames-per-second rate that you want to know results for. You tell it how many
- objects to start off with (must be greater than 0) and how many objects to add
- each time (greater than 0). Iterations in the second demo represent how many
- loops to do before adding more objects, and this should not be too low or the
- routine won't work out wether it's reached the target framerate properly or not.
- You have to set a maximum number of objects, because the table has to be
- initialised for the eventuality of that many objects becoming displayed. I set it
- to 3000 initially which is more than enough for most usual object sizes on all
- cpu's.
-
- There are versions of the demo for 030 and for 040(+) as before. You set the
- program running and it will progressively add objects and move them and will
- keep doing so until it reaches the target framerate. Then it will tell you what
- precicely the framerate was at the end and how many objects it managed to display
- at that rate. So instead of having to do tests on some constant number of
- objects, you can let the routine chug away adding more and more until you are
- maxxed out for the selected framerate. The demo will default to doing 16x16
- objects, starting with 10, and adding 10 more ever 40 loops. You can set the
- starting number of objects (objcount) to a value much closer to what you expect
- will be the end result, in order to hasten the report. The starting number of
- objects should never excede the maximum number, however, or it will probably
- start drawing objects everywhere in memory (65536x65536!).
-
- Unless you are particularly fussy you do NOT need to have a planar doublebuffer
- when using the c2p's, so long as they are running quite fast, ie that the overall
- routine is not slowing below 25fps, or not much anyway. There will be a slight
- flicker on one or two lines of the display, perhaps, but it will not be the
- full-screen type of flickering that you can get on planar. Yes, the c2p's are
- outputting to planar but the way they do it seems to minimise flicker. I
- personally do not use a doublebuffer and I hardly notice that I haven't. When
- you're in the middle of all the action you won't notice either so long as things
- don't slow down too much. Even if the overall routine slows down the c2p will
- still take the same amount of time so it should be okay. So you can probably cut
- out the time it takes to do screen swapping or other doublebuffer methods. Of
- course, if you have a graphics card, it is fairly normal to refresh straight into
- the display and people have reported that there is little or no flicker
- whatsoever.
-
- I would be interested to know how any of these routines perform on your
- specification of Amiga, and particularly how well the clearscreening c2p's do on
- 68060's, ie, does it clearscreen `for free'. Problems or ideas, give me a yell.
-
- If you get any problems implementing or using or adapting or modifying the
- routines, email at paul@stationone.demon.co.uk
-
- Enjoy.
-